Using the data collected from existing customers, build a model that will help the marketing team identify potential customers who are relatively more likely to subscribe term deposit and thus increase their hit ratio.
#Importing all the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#to ignore warninggs
import warnings
warnings.filterwarnings("ignore")
#loading the bank-full data that is provided in cvs format
bank_data = pd.read_csv("bank-full.csv")
bank_data.head()
#checking the data types
bank_data.dtypes
#shape of data
bank_data.shape
Dataset contains 45211 rows and 17 columns
#getting info of the dataframe
bank_data.info()
We can see there are 2 differet types of datatypes i.e. object and int possessed by different columns and none of them contain null values.
#Checking the missing values
bank_data.isna().sum()
We can see there are no missing values present in any of the columns but from the above observations we can see there are many "unkown", "other" etc. which can be considered as missing values in the data.
#Checking the description of the individual attributes
bank_data.describe(include = "all").T
#unique values of the object columns in the data exluding int and float columns
obj_columns = bank_data.select_dtypes(exclude = ["int64"]).columns.tolist()
for cols in obj_columns:
print(f"Unique values of {cols} are {bank_data[cols].unique()}")
#unique values in each columns
bank_data.apply(lambda x: len(x.unique()))
#listing the number of value counts of each object columns
for column in obj_columns:
print("\n" + column)
print(bank_data[column].value_counts())
#Before dropping any rows or column, let copy the original dataset and we will perform all the analysis in copied dataset
bank_df1 = bank_data.copy()
bank_df1.head()
#We can see from above analysis that Job column has 288 unkown rows with no category
#let's drop those unkown rows from the Job column in the dataset
bank_df1.drop(bank_df1[bank_df1.job == "unknown"].index, axis = 0, inplace = True)
bank_df1.job.value_counts()
#Similarly we can observe there are 1857 unkown rows in the education coulumn so we can get rid of those columns as well
bank_df1.drop(bank_df1[bank_df1.education == "unknown"].index, axis = 0, inplace = True)
bank_df1.education.value_counts()
#we can see there are very high number of unkown rows i.e. 29285 in contact column.
#We can drop the entire contact column
bank_df1.drop("contact", axis = 1, inplace = True)
#Similarly there are very high number of unknown entries in poutcome column i.e. 36959
#So it's better to drop the entire poutcome column
bank_df1.drop("poutcome", axis = 1, inplace = True)
bank_df1.describe().T
-The balance column has min value -8019 which can be considered either typo or outliers since the average annual balance shouldn't be negative.
-we can see that there are numerous entries of 0 or -1 in pdays and previous columns.Dropping these columns should affect the analysis.
# Age
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["age"])
plt.show()
By looking at the boxplot, we can obseerve there are some outliers in the age column
#Calculating the outliers in the age distribution column
print("The outliers are above", bank_df1["age"].quantile(q = 0.75)
+ 1.5*(bank_df1["age"].quantile(q = 0.75) - bank_df1["age"].quantile(q = 0.25)), "of years age")
#calculating the percentage of outliers in age column
age_outliers = bank_df1[bank_df1["age"] > 70.5]["age"].count()
total_clients_age = len(bank_df1)
print("Percentage outliers = ", round(age_outliers/total_clients_age * 100, 2), "%")
As we can see there is just 1% of outliers in age column which is very less. So we can either fit the model with or without age column.
#balance
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["balance"])
plt.show()
#Calculating the outliers in the balance distribution column
print("The outliers are above", bank_df1["balance"].quantile(q = 0.75)
+ 1.5*(bank_df1["balance"].quantile(q = 0.75) - bank_df1["balance"].quantile(q = 0.25)))
#calculating the percentage of outliers in age column
balance_outliers = bank_df1[bank_df1["balance"] > 70.5]["balance"].count()
total_clients_balance = len(bank_df1)
print("Percentage outliers = ", round(balance_outliers/total_clients_balance * 100, 2), "%")
So we can see there is about 75% of ouliers in balance column which is very high.
from scipy.stats import zscore
bank_df1["balance_outliers"] = zscore(bank_df1["balance"])
bank_df1.drop(bank_df1[(bank_df1["balance_outliers"] > 3) | (bank_df1["balance_outliers"] < -3)].index,axis=0,inplace=True)
# We don't need the zscore column anymore
bank_df1.drop("balance_outliers",axis=1,inplace=True)
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["balance"])
plt.show()
We can still see some ouliers to the negative side . so let's drop those too.
bank_df1.drop(bank_df1[bank_df1.balance <- 2500].index,axis=0,inplace=True)
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["balance"])
plt.show()
#checking the outliers in day column
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["day"])
plt.show()
The box plot shows no outliers in day column
#checking the outliers in duration column
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["duration"])
plt.show()
We can see there are outliers in duration column. Since we don't know the duration in advance and so is the result, we can drop this entire duration column and not concerned about the outliers.
bank_df1.drop("duration",axis=1,inplace=True)
#checking the outliers in duration column
plt.figure(figsize = (10, 6))
sns.boxplot(bank_df1["pdays"])
plt.show()
Since most of the entries in pdays column have specific value, it's better to drop the pdays column for further analysis
bank_df1.drop("pdays",axis=1,inplace=True)
#checking the outliers in duration column
plt.figure(figsize=(10, 6))
plt.subplot(1,2,1)
sns.distplot(bank_df1["previous"])
plt.subplot(1,2,2)
sns.boxplot(bank_df1["previous"])
plt.show()
We can see large number of entries in "previous" column has 0. so we can drop the previous column
bank_df1.drop("previous",axis=1,inplace=True)
bank_df1.shape
bank_df1.head()
#Relationship between job and target columns
plt.figure(figsize=(20, 6))
sns.countplot("job", hue = "Target", data = bank_df1)
plt.show()
we can see large number of subscriber are having management and technician job title whereas vey less number of suscriber are having entrepreneur and housemaid job title.
#relationship between marital and target columns
plt.figure(figsize=(10, 6))
sns.countplot("marital", hue = "Target", data = bank_df1)
plt.show()
So from the above plot we can conclude that married and single clients are more likely to subscribe than the divorced ones
#relation between month and balance with target column
plt.figure(figsize=(20, 6))
sns.boxplot(x="month", y="balance", hue="Target",data=bank_df1)
plt.show()
From the above box plot between balance, month and target, we can observe that people having higher balance are more likely to subscribe. And we can also observe that there are more higher number of subscriber in March, September, October and November.
#relation between education and target column
plt.figure(figsize=(10, 6))
sns.countplot("education", hue = "Target", data = bank_df1)
plt.show()
Clients having secondary education are the highest subscriber followed by secondary and primary.
#relation between default and target column
plt.figure(figsize=(10, 6))
sns.countplot("default", hue = "Target", data = bank_df1)
plt.show()
The credit in deault for most of the clients are very less and tends to subscriber more than who are in defaults.
#relation between housing and target column
plt.figure(figsize=(10, 6))
sns.countplot("housing", hue = "Target", data = bank_df1)
plt.show()
More number of clients have housing loan and they subscribe less than the clients who don't have housing loan.
#relation between loan and target column
plt.figure(figsize=(10, 6))
sns.countplot("loan", hue = "Target", data = bank_df1)
plt.show()
Many clients don't have personal loan and they subscriber more than the clients have personal loan
#relation between day and target column
plt.figure(figsize=(15, 6))
sns.countplot("day", hue = "Target", data = bank_df1)
plt.show()
Here we can't really establish any solid relation between particular day and the number of subscriber.
#relation between campaign and target column
plt.figure(figsize=(15, 6))
sns.countplot("campaign", hue = "Target", data = bank_df1)
plt.show()
Here we can observe clients who were contacted less are better subscriber.
#relation between month and target column
plt.figure(figsize=(10, 6))
sns.countplot("month", hue = "Target", data = bank_df1)
plt.show()
There are more number of subscribers in consecutive five months i.e. from April to August than the rest of the months.
# let's plot the pairplot and check the relation between multiple numerical variables columns with Target column.
sns.pairplot(bank_df1[["age", "balance", "day", "campaign", "Target"]], hue = "Target")
#let's create the dummy variables for job and drop the original job column
bank_df1 = pd.concat([bank_df1, pd.get_dummies(bank_df1.job,drop_first=True)], axis=1).drop("job",axis=1)
#let's create the dummy variables for marital and drop the original marital column
bank_df1 = pd.concat([bank_df1, pd.get_dummies(bank_df1.marital,drop_first=True)], axis=1).drop("marital",axis = 1)
#let's create the dummy variables for education and drop the original education column
bank_df1 = pd.concat([bank_df1, pd.get_dummies(bank_df1.education,drop_first=True)], axis=1).drop("education",axis=1)
#let's create the dummy variables for month and drop the original month column
bank_df1 = pd.concat([bank_df1, pd.get_dummies(bank_df1.month, drop_first = True)], axis = 1).drop("month",axis = 1)
#Replace values yes with 1 and no with 0 in rest of the categorical columns
#default
bank_df1.default = bank_df1.default.map({"yes":1,"no":0})
#housing
bank_df1.housing = bank_df1.housing.map({"yes":1,"no":0})
#loan
bank_df1.loan = bank_df1.loan.map({"yes":1,"no":0})
#Target
bank_df1.Target = bank_df1.Target.map({"yes":1,"no":0})
bank_df1.head()
bank_df1.shape
#let's import the necessary libraries from sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
# Separating dependent and independent variables
X = bank_df1.drop(['Target'], axis = 1)
y = bank_df1['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
X_train.shape, X_test.shape
y_train.shape, y_test.shape
# It's better to check the effect of scaling on the dataset, we create another set of data with scaling function applied to it
from sklearn import preprocessing
SX = preprocessing.scale(X)
SX_train, SX_test, y_train, y_test = train_test_split(SX, y, test_size = 0.3, random_state = 42)
SX_train.shape, SX_test.shape
# Importing lbraries from the SCIKIT LEARN
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
model_a = LogisticRegression()
# Applying training data to the logistic regression model
model_a.fit(X_train, y_train)
# Predicting the test results
y_predict = model_a.predict(X_test)
# Caculating the coefficients of logistic regression model
t = list(X_train.columns)
coef_df = pd.DataFrame(model_a.coef_, columns= t)
coef_df["intercept"] = model_a.intercept_
print(coef_df)
# Calculating the model score and print confusion matrix
model_a_score = model_a.score(X_test, y_test)
print(model_a_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
print("Test score: {}".format(model_a.score(X_test, y_test) * 100))
model_b = LogisticRegression()
model_b.fit(SX_train, y_train)
# Predicting the test results
y_predict = model_b.predict(SX_test)
# Calculating the coefficients of logistic regression model
coef_df = pd.DataFrame(model_b.coef_, columns= t)
coef_df["intercept"] = model_b.intercept_
print(coef_df)
#Calculating the model score and print confusion matrix
model_b_score = model_b.score(SX_test, y_test)
print(model_b_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
print("Test score: {}".format(model_b.score(SX_test, y_test) * 100))
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
from os import system
dTree = DecisionTreeClassifier(criterion = 'gini', random_state =42)
dTree.fit(X_train, y_train)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
train_char_label = ['No', 'Yes']
Bankdata_Tree_File = open('bankdata_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file = Bankdata_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
Bankdata_Tree_File.close()
retCode = system("dot -Tpng bankdata_tree.dot -o bankdata_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("bankdata_tree.png"))
# Reducing over fitting (Regularization)
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=42)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
train_char_label = ['No', 'Yes']
Bankdata_Tree_FileR = open('bankdata_treeR.dot','w')
dot_data = tree.export_graphviz(dTreeR, out_file=Bankdata_Tree_FileR, feature_names = list(X_train), class_names = list(train_char_label))
Bankdata_Tree_FileR.close()
#Works only if "dot" command works on you machine
retCode = system("dot -Tpng bankdata_treeR.dot -o bankdata_treeR.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("bankdata_treeR.png"))
# importance of features in the tree building ( The importance of a feature is computed as the
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = X_train.columns))
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)
#Visualizing the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
dTreeRS = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=42)
dTreeRS.fit(SX_train, y_train)
print(dTreeRS.score(SX_train, y_train))
print(dTreeRS.score(SX_test, y_test))
#importing required libraray for Bagging from sklearn
from sklearn.ensemble import BaggingClassifier
bg_cl = BaggingClassifier(base_estimator = dTree, n_estimators = 50,random_state = 42)
bg_cl = bg_cl.fit(X_train, y_train)
# Predicting the test results
y_predict = bg_cl.predict(X_test)
# Calculating the model score and print confusion matrix
bg_cl_score = bg_cl.score(X_test, y_test)
print(bg_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
#visualizing the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict,labels = [0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot = True ,fmt = 'g')
bgs_cl = BaggingClassifier(base_estimator = dTree, n_estimators = 50,random_state = 42)
bgs_cl = bgs_cl.fit(SX_train, y_train)
# Predicting the test results
y_predict = bgs_cl.predict(SX_test)
bgs_cl_score = bgs_cl.score(SX_test, y_test)
print(bgs_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
from sklearn.ensemble import AdaBoostClassifier
ab_cl = AdaBoostClassifier(n_estimators = 10, random_state = 42)
ab_cl = ab_cl.fit(X_train, y_train)
# Predicting the test results
y_predict = ab_cl.predict(X_test)
# Calculating the model score and print confusion matrix
ab_cl_score = ab_cl.score(X_test, y_test)
print(ab_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
#visualizing the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict,labels = [0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot = True ,fmt = 'g')
abs_cl = AdaBoostClassifier(n_estimators = 10, random_state = 42)
abs_cl = abs_cl.fit(SX_train, y_train)
# Predicting the test results
y_predict = abs_cl.predict(SX_test)
# Calculating the model score and print confusion matrix
abs_cl_score = abs_cl.score(SX_test, y_test)
print(abs_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
from sklearn.ensemble import GradientBoostingClassifier
gb_cl = GradientBoostingClassifier(n_estimators = 50,random_state = 42)
gb_cl = gb_cl.fit(X_train, y_train)
# Predicting the test results
y_predict = gb_cl.predict(X_test)
# Calculating the model score and print confusion matrix
gb_cl_score = gb_cl.score(X_test, y_test)
print(gb_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
#visualizing the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict,labels = [0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot = True ,fmt = 'g')
gbs_cl = GradientBoostingClassifier(n_estimators = 50,random_state = 42)
gbs_cl = gbs_cl.fit(SX_train, y_train)
# Predicting the test results
y_predict = gbs_cl.predict(SX_test)
# Calculating the model score and print confusion matrix
gbs_cl_score = gbs_cl.score(SX_test, y_test)
print(gbs_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
from sklearn.ensemble import RandomForestClassifier
rf_cl = RandomForestClassifier(n_estimators = 50, random_state = 1, max_features = 12)
rf_cl = rf_cl.fit(X_train, y_train)
# Predicting the test results
y_predict = rf_cl.predict(X_test)
# Calculating the model score and print confusion matrix
rf_cl_score = rf_cl.score(X_test, y_test)
print(rf_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
#visualizing the confusion matrix
cm=metrics.confusion_matrix(y_test, y_predict,labels = [0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot = True ,fmt = 'g')
rfs_cl = RandomForestClassifier(n_estimators = 50, random_state = 1, max_features = 12)
rfs_cl = rf_cl.fit(SX_train, y_train)
# Predicting the test results
y_predict = rf_cl.predict(SX_test)
# Calculating the model score and print confusion matrix
rfs_cl_score = rfs_cl.score(SX_test, y_test)
print(rfs_cl_score)
print(metrics.confusion_matrix(y_test, y_predict))
print(metrics.classification_report(y_test, y_predict))
modelsA = pd.DataFrame({
'Models': ["Logistic regresssion", "Descison Tree", "Bagging",
"AdaBoosting", "Gradient boosting", "RandomForest Classifier"],
'TestScore': [model_a_score, dTreeR.score(X_test , y_test),
bg_cl_score, ab_cl_score, gb_cl_score, rf_cl_score]})
modelsA.sort_values(by='TestScore', ascending=False)
modelsB = pd.DataFrame({
'Models': ["Scaled Logistic regresssion", "Scaled Descison Tree", " Scaled Bagging",
"Scaled AdaBoosting", "Scaled Gradient boosting", "Scaled RandomForest Classifier"],
'TestScore': [model_b_score, dTreeRS.score(SX_train, y_train),
bgs_cl_score, abs_cl_score, gbs_cl_score, rfs_cl_score]})
modelsB.sort_values(by='TestScore', ascending=False)
For this I have used different models to predict the best outcome. I have used six diffrent models i.e Logistic Regression, Decision Tree and ensemble techniques such as Bagging, AdaBoosting, Gradient Boosting and Random Forest Classifierand compare their diffrent matrices among each other.
I have also applied the scaled dataset to the same models to increase the chances of prediction.
Among all the models, we can observe that Random Forest Classifier(Ensemble technique) predicts the best outcome as it's Test Score(0.887389) is the highest amongst all models.
The Random Forest Classifier is followed by Bagging(0.885663), Decision Tree(0.884172), Gradient Boosting(0.884093, Logistic Regression(0.883936) and Adaboosting(0.88288).
Even after scaling the dataset, Random Forest Classifier stands out the best model to predict the outcome followed by Bagging, Decision Tree, Logistic Regression, Gradient Boosting, AdaBoosting